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Abstract 

o3 ; 

+^ \ We describe a statistical method to avoid biased estimation of the content of 

i-^ ■ different particle species. We consider the case when the particle identification 

^ . information strongly depends on some kinematical variables, whose distributions 

are unknown and different for each particles species. We show that the proposed 
procedure provides properly normalized and completely data-driven estimation of 
^ ,-h the unknown distributions without any a priori assumption on their functional 

Q_i! form. Moreover, we demonstrate that the method can be generalized to any 

kinematical distribution of the particles. 

> : 

y§ 1 Introduction 

(N \ 

The estimation of the particle species content in a sample of reconstructed tracks is 
a recurrent problem in Particle Physics. To this purpose, different experimental tech- 
' niques are used to obtain information about the particle type; typical examples are 

the measurement of the particle Time-of-Flight ( ToF) from the production vertex to 
£> . a given position inside the detector, or the measurement of the particle energy loss 

K> , per unit length of the traveled path due to the interaction with the detector material 

"£j ; {dE/dx). 

The information provided by these techniques is related to the particle type but typi- 
cally depends also on the track momentum. 

It is common practice to include the particle identification (PID) information in a 
Maximum Likelihood (ML) fit in order to estimate the particle species content of the 
sample. On the other hand, the strong momentum dependence of the separation power 
between different particles may lead to strongly biased results, if not properly treated 
in the ML fit. 

To be more specific, let's consider a mixture of known particle species, for example pions 
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(it), kaons (K), protons (p), and electrons (e), and assume that the PID information 
is provided by the dE/dx measurement in a drift chamber. Using the separating power 
provided by the PID, we want to estimate the unknown fractions f n , fx, f p , and f e of 
each particle type contained in the sample by means of a ML fit. Our observables are 
the measured dE/dx response (that we will indicate as x) and the momentum of the 
particle p. We then label as tj the particle hypothesis and the conditional probability 
density function of Xi for track i, given pi and tj, will be indicated as V(xi\pi, tj). 
Finally, the likelihood function is expressed as: 

L (fj)=H( e fpfapiitj)) 

i j=ir,K,p,e 

= ri( e fpfaiPi, ^) x Hvtii) ) , (i) 

i j=n,K,p,e 

where % is the track index, with the additional condition: 

E fi = 1 - ( 2 ) 

j=n,K,p,e 

In practice, we often have poor information on the distributions of the additional 
observables {V(pi\tj) in our example). Sometimes they are completely unknown. This 
is the case, for example, of particles produced during the hadronization of B mesong^]: 
the momentum distribution of each particle type is unknown and the correct likelihood 
function as defined in ([T]) cannot be constructed. 
Note that using the conditional likelihood function: 

L{fj)=\{{ E mxiiPi, t^) (3) 

i j=ir,K,p,e 

may lead to strongly biased results, if our additional variable, the momentum in our 
example, has different distributions for different particle types since the V(pi\tj) term 
cannot be factorized in ([I]) . As discussed in [2] and shown in [5] specifically for the par- 
ticle species estimation, whenever the templates used in a multi-component fit depend 
on additional observables, it is necessary to use the complete likelihood expression, and 
explicitly include the probability distributions of all observables. In our example this 
implies that we need to include the momentum distributions of each particle type in 
the likelihood. The crucial question is how to avoid strong bias in the particle fraction 
estimation when the momentum distributions of each particle type are unknown. 
In [5] it was shown that a possible solution to this problem is to use a series expansion 
of the unknown distributions; the Fourier coefficients of the series are free parameters 
determined by the fit. 

5 This work was motivated by the effort of understanding the properties of particles produced during 
B mesons hadronization, that represents a major issue in the development of flavor tagging algorithms 
like the Same Side Kaon Tagging [T]. 
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In this paper we propose a different strategy, based on the idea that if the fit is per- 
formed in sufficiently small momentum intervals, the bias due to the use of the likeli- 
hood function ([3]) is small and it goes to zero as the momentum interval width decreases. 
In Sec. |2]we present a simple procedure to perform the fit and show that we can extract 
the unknown momentum distributions of each particle type in a completely data driven 
mode without any a priori assumption on the corresponding functional forms. 
In Sec. |3j given the results of our fitting procedure, we describe a powerful technique 
to extract the distribution of any kinematical variable for any particle type, with the 
proper normalization. Finally, in Sec. H] some applications of the method to Pythia [lj 
Monte Carlo samples simulated with the CDF detector [3] are shown. 



2 Estimation of Particle Fractions and Momentum 
Distributions 

As anticipated above, we restrict ourselves to the particles contained in small momen- 
tum intervals of equal width Ap. In each bin we estimate the particle content using a 
ML fit by observing that if Ap is sufficiently small, the bias introduced by using the 
conditional likelihood of is negligible. As will be evident in the following, we find 
more convenient to use an Extended Likelihood (EL) fit (see [B] for a definition). In 
each momentum bin m the Extended Likelihood function takes the form: 

N m M 

\og{L m ) = £ m = ^\og(^2 Nj^Vjixilm^tj)) - N m (4) 

i=0 j=l 
N m M M 

i=0 j=l 3=1 

where N m is the total number of particles in the m-th momentum bin, M is the number 
of different particle species, Nj tm is the number of particles of type j, Xi is the PID 
response for the m-th momentum bin and Vj(xi\m,tj) is the conditional probability 
density function associated to x for the particle type tj in the m-th momentum bin . 
In the rest of the paper we will simplify the notation by setting Vj(xi) = Vj(xi\m,tj). 
The EL must be maximized with respect to the free parameters A^. m . 
We notice that, given its particular form, the first derivative of our EL function can be 
evaluated analytically as: 

dC m _ g Vjjxj) _ 1 ^ 



9Nhm ^ E N Km V k {x t ) 

k=l 



The critical points of (jSJ) can be obtained using an iterative algorithm. In particular, 
starting from a first guess on our parameters, Nj m , we can define an iterative procedure, 
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a Picard iteration [7], of the form: 



4=1 E Km^) 



k=l 



where at every step n of the iteration, we estimate the TV™ that will be used as input 
values for the step n + 1. The iteration converges to the roots of (151). Similarly, we 
can write the analytical expression of the second derivative of (jSJ) and estimate the 
statistical error of our parameters from the inverse of the covariance matrix V^T™ of 

J t ,771 

the fit in the m-th momentum bin as: 



jl ' m dN im dN lm ^ ( u \ 2 • 1 J 

Our iterative procedure is equivalent to the Channel Likelihood method introduced in 
[8]. It can be easily shown that the Channel Likelihood method is a maximization of 
an EL function of the form (jSJ) using an iterative scheme similar to the one we propose. 
This approach has several advantages: no functional shape is assumed for the momen- 
tum spectra, the method, provided the Vj(x) templates, is completely data driven; all 
the fits can be performed in parallel, that is the iteration can be done in all momentum 
bins at the same time; the algorithm is very fast and stable respect to the initial guess 
on the Nj )Tn . 

Once we obtain convergence of the iterative process, we can write the fraction of each 
particle type as: 

m 

Observing that each bin content is fitted independently, the corresponding statistical 
uncertainty is: 

y m 

Finally, since we are performing the fit in each momentum bin, the arrays of Nj >m 
can be interpreted as histograms, that reasonably approximate the true momentum 
distributions of each particle type. In this way, we obtained an unbiased estimation of 
the particle composition, although the V(pi\tj) were unknown, and, at the same time, 
a reasonable estimation of the V(pi\tj) themselves. 

We test our method on parametric Monte Carlo samples. Each sample consists of 
20,000 particles. Different momentum spectra are used to generate each particle type 
and the corresponding fractions are fixed at: f n = 0.74, f K = 0.17, fp = 0.07, and 
f e = 0.02 . We divide the sample in 50 momentum bins from 0.45 GeV/c up to 5 
GeV/c and a EL fit is performed in each momentum bin. If no particle falls in a given 



2 ESTIMATION OF PARTICLE FRACTIONS AND MOMENTUM DISTRIB UTI0NS5 



| Pion bin by bin Residuals~| | Kaon bin by bin Residuals | 
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bin number bin number 



Figure 1: Mean and width of the distribution of the difference of the true and measured 
fractions in each momentum bin. 



bin, the corresponding fit is not performed for obvious reasons. We repeat the fit on 
500 parametric samples to extract the residual distributions of the estimators. No 
significant bias is observed as summarized in Tab. [TJ 

Another interesting test is the residual distribution of the estimators in each momentum 
bin for the 500 samples. Fig. [1] shows the residuals of the four estimators as a function 
of the bin number (i.e. momentum). In each plot, each point represents the mean value, 
while the error bar represents the width of the distribution of the residual. We observe 
no significant bias. Examples of the unknown momentum distributions obtained from 
the fit are shown in Fig. [2j 

A very good agreement between the true distributions (filled histograms) and the fit 
(dots) is observed for all particle types. 
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EL 


(0.2 ±2) x 10~ 4 


(2 ± 2) x 10~ 4 


(3± 1) x 10~ 4 


(5± 1) x 10" 4 



Table 1 : Mean values of the residuals of the particle fractions (integrated on all momentum 
bins) estimated in 500 parametric samples. 
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3 Extracting distributions of other quantities 

Besides estimating the momentum spectra for the different particle species, we are 
typically also interested in obtaining distributions of additional relevant kinematical 
variables. We achieve this by combining our fitting procedure with a technique known 
as sPlot [9] , which allows to estimate the composition of a mixture of several compo- 
nents and their covariance matrix by means of an Extended Maximum-Likelihood fit. 
The main requirement for the sPlot technique is that the kinematical variables to be 
plotted be uncorrelated with the discriminating variables of the fit. The output is 
represented by an histogram, called sPlot, of the relevant variables. 
In our problem, this means that the variables we want to plot must not be correlated 
with the pdf's that describe the PID response. However, any dependence of the pdf's 
on kinematical variables is frozen by our strategy to perform a separate fit in each 
momentum bin: the correlation of any variable with the momentum, if any, can be 
neglected inside a sufficiently small bin. 

For each momentum bin, our fit provides an estimate of the particle content and the 
corresponding covariance matrix. It follows that for each bin we can produce the sPlot 
of any kinematical variable. 

Suppose we want to produce the sPlotj im (y) of a given variable y for the particle type 
j, corresponding to the EL fit performed in the m-th bin. As explained in [9], we 
have to properly combine the result of the fit in the m-th bin with the corresponding 
covariance matrix to define an event by event weight called the s- weight, swij^ m (x). 
We then have to fill an histogram of y where each event i, falling in the m-th bin, is 
weighted according to the s- weight: 

M 

Yl Vji, m Vi(xi) 

sw ijjm (x) = , (10) 

k=l 

where are obtained by performing the iteration ([6]) and Vji )Tn is the covariance 

matrix ([7j) corresponding to the fit performed in the m-th bin. Applying the above 
procedure in each momentum interval, we obtain an array of sPlotj ym (y) histograms. 
We then exploit the additive property of the sPlofs to add together all the sPlotj >m (y) 
corresponding to a given variable, one for each momentum bin; the resulting sPlot 
represents the distribution of the given variable y for the whole momentum range: 

sPlotjiy) = sPlot jim (y) . (11) 

m 

Thanks to the combination of the sPlot and our strategy of performing several fits 
in small momentum bins, we are able to obtain the distribution of any kinematical 
variable of each particle type belonging to our initial sample. 
The method can be summarized in a three-step procedure as follows: 
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1 . Fit the particle species content in several momentum intervals by means of the 
Extended Likelihood method via an iterative scheme. 

2. In each momentum bin, using the results of the fits, produce the sPlofs of any 
additional variable you are interested in for all the particle species. 

3. Separately for each particle type, add the array of sPlotfs of a given variable to 
extract the distribution of the variable on the whole data sample. 

4 Monte Carlo Study of the Complete Procedure 

To test the whole procedure in a realistic case, we use a Pythia Monte Carlo [I] sample 
of B~ — > D°7T~ events processed with the CDF detector simulation plus a parametric 
simulation of dE/dx and ToF. Such a sample provides a reasonable description of all 
the kinematical variables associated with a given particle and the correlations among 
them. We then fit the composition of the particles produced in the vicinity of the B~ 
mesorj^. 

The ToF response for a particle depends on two observables: the momentum and the 
length traveled by the particle (also called arc-length). Consequently, the use of the 
ToF requires, in principle, a two-dimensional binning in momentum and arc-length. 
However, given the cylindrical geometry of the CDF detector, momentum and arc- 
length can be replaced by the momentum component in the transverse plane pr and 
the pseudorapidity rj^ with no loss of information. Tab. [2] reports the total fraction of 
each particle type and Fig. [2] shows the px distributions resulting from the fit. 
Combining the particle content and the covariance matrix estimated in each p? bin, 
we are able to produce the sPlot of other kinematical variables for the whole Monte 
Carlo sample. Some examples are shown in Figs. [SB track rj, the distance AR = 
A0 2 + Arj 2 between the particle and the B meson flight directions, the longitudinal 
component p™ 1 of the particle momentum with respect to the B meson flight direction 
(this variable is often used in the B flavor tagging algorithms). 

A very good agreement is observed between the true distributions and the estimated 
ones, even in those cases where the plotted variable is strongly correlated to the px- 
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Generated 
Estimated 


0.828 
0.829 ±0.002 


0.112 
0.111 ±0.001 


0.0557 
0.0557 ±0.0006 


0.0039 
0.0044 ±0.0003 



Table 2: Comparison between the true particle content in the Pythia Monte Carlo sample 
(Generated) and the fractions resulting from the fits (Estimated). 



6 This is a typical problem encountered during the development and the calibration of flavor tagging 
algorithms for the B meson. 

7 The pseudorapidity of a track is defined as rj — — log(tan(|)), where 6 is the polar angle. 
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Figure 3: Pythia MC rj distributions overlaid to the corresponding data sPlot. 
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Figure 4: Pythia MC AR distributions overlaid to the corresponding data sPlot. 
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5 Conclusions and outlook 

We have presented a method to estimate the distributions of kinematical variables of 
different particle species using information from PID detectors. This method is shown 
to work very well on Monte Carlo data allowing the determination of the fractional 
composition of a mixed sample of particle types with remarkable precision. We use a 
likelihood function that correctly contains the pdf s of all the relevant event observables 
and therefore avoids a common mistake leading to strongly biased estimations. This 
approach looks very promising for the determination of distributions of different types 
of particles produced, for example, in conjunction with a B meson in a completely 
data-driven mode, without making any prior assumption on their functional forms. 
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